For high school students who excel in mathematics, the International Mathematical Olympiad (IMO) is the top international competition. The event, which was first held in Romania in 1959, has expanded significantly from the seven original participating nations to include more than 100 nations from five continents. The IMO, held annually in different host countries, presents competitors with six complex mathematical problems—three per day—spread over two days.
This dataset originates from the International Mathematical Olympiad and contains detailed information about participating countries. It includes the distribution of genders within each team, individual scores for different sections of the competition, the number of gold, silver, and bronze awards won, the count of honorable mentions received by each country, as well as the names of the team leader and deputy leader.
The goal of this study is to determine the most effective clustering technique for examining the performance of countries participant over decades. The objective is to identify distinct patterns or groupings in the data that reflect the historical and comparative performance across different time periods and geographic regions, thus providing insights into the evolution of mathematical proficiency globally.
dim(df)
## [1] 3780 18
head(df)
## # A tibble: 6 × 18
## year country team_size_all team_size_male team_size_female p1 p2 p3
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2024 United … 6 5 1 42 41 19
## 2 2024 People'… 6 6 0 42 42 31
## 3 2024 Republi… 6 6 0 42 37 18
## 4 2024 India 6 6 0 42 34 11
## 5 2024 Belarus 6 6 0 42 30 10
## 6 2024 Singapo… 6 6 0 42 37 7
## # ℹ 10 more variables: p4 <dbl>, p5 <dbl>, p6 <dbl>, p7 <lgl>,
## # awards_gold <dbl>, awards_silver <dbl>, awards_bronze <dbl>,
## # awards_honorable_mentions <dbl>, leader <chr>, deputy_leader <chr>
summary(df)
## year country team_size_all team_size_male
## Min. :1959 Length:3780 Min. :1.000 Min. :0.0
## 1st Qu.:1995 Class :character 1st Qu.:6.000 1st Qu.:5.0
## Median :2006 Mode :character Median :6.000 Median :6.0
## Mean :2004 Mean :5.742 Mean :5.2
## 3rd Qu.:2016 3rd Qu.:6.000 3rd Qu.:6.0
## Max. :2024 Max. :8.000 Max. :8.0
## NA's :283
## team_size_female p1 p2 p3
## Min. :0.000 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.:1.000 1st Qu.:12.00 1st Qu.: 3.25 1st Qu.: 0.000
## Median :1.000 Median :26.00 Median :12.00 Median : 2.000
## Mean :1.066 Mean :24.74 Mean :15.44 Mean : 6.958
## 3rd Qu.:1.000 3rd Qu.:38.00 3rd Qu.:26.00 3rd Qu.:10.000
## Max. :6.000 Max. :56.00 Max. :56.00 Max. :64.000
## NA's :2180 NA's :110 NA's :110 NA's :110
## p4 p5 p6 p7
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Mode:logical
## 1st Qu.:10.00 1st Qu.: 2.00 1st Qu.: 0.000 NA's:3780
## Median :23.00 Median :10.00 Median : 1.000
## Mean :23.01 Mean :14.09 Mean : 5.698
## 3rd Qu.:36.00 3rd Qu.:23.00 3rd Qu.: 7.000
## Max. :56.00 Max. :56.00 Max. :63.000
## NA's :110 NA's :110 NA's :110
## awards_gold awards_silver awards_bronze awards_honorable_mentions
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :0.000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:0.000
## Median :0.0000 Median :0.0000 Median :1.000 Median :1.000
## Mean :0.4706 Mean :0.9603 Mean :1.417 Mean :1.177
## 3rd Qu.:0.0000 3rd Qu.:2.0000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :6.0000 Max. :6.0000 Max. :6.000 Max. :6.000
## NA's :2 NA's :2 NA's :2 NA's :515
## leader deputy_leader
## Length:3780 Length:3780
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
The data consists of 3780 observations and 18 variables. Overall, there are 68 040 data points.
In the International Mathematical Olympiad, the maximum members in one team is 6, some of the teams have had both male and female, some of them has only male or female nominators.
unique_countries <- unique(df$country)
num_unique_countries <- length(unique_countries)
num_unique_countries
## [1] 139
There are the total of 139 countries over 5 continents that participated in the International Mathematical Olympiad from 1959 to 2024.
First of all, the missing value will be checked.
colSums(is.na(df))
## year country team_size_all
## 0 0 0
## team_size_male team_size_female p1
## 283 2180 110
## p2 p3 p4
## 110 110 110
## p5 p6 p7
## 110 110 3780
## awards_gold awards_silver awards_bronze
## 2 2 2
## awards_honorable_mentions leader deputy_leader
## 515 870 968
#there are 2 columns regarding information about leader and deputy leader for each team that will not be used in further clustering, hence, they will be removed. Also, the entire p7 column will also be get rid of as the data is unfilled
df_cleaned <- df[, !colnames(df) %in% c("p7", "leader", "deputy_leader")]
#the missing value for gender contribution - team size male/female is due to the fact that there are sorely male or female in the team, hence, missing value will be change to 0
df_cleaned$team_size_male[is.na(df_cleaned$team_size_male)] <- 0
df_cleaned$team_size_female[is.na(df_cleaned$team_size_female)] <- 0
#check for mising data in p1-p6 (points from every exam from the competition)
df_cleaned[is.na(df_cleaned$p1) | is.na(df_cleaned$p2) | is.na(df_cleaned$p3) | is.na(df_cleaned$p4) | is.na(df_cleaned$p5) | is.na(df_cleaned$p6), ]
## # A tibble: 110 × 15
## year country team_size_all team_size_male team_size_female p1 p2 p3
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2010 Democr… 6 0 0 NA NA NA
## 2 1991 Democr… 6 0 0 NA NA NA
## 3 1983 Germany 6 6 0 NA NA NA
## 4 1983 United… 6 6 0 NA NA NA
## 5 1983 Hungary 6 6 0 NA NA NA
## 6 1983 Union … 6 6 0 NA NA NA
## 7 1983 Romania 6 6 0 NA NA NA
## 8 1983 Vietnam 6 6 0 NA NA NA
## 9 1983 Bulgar… 6 5 1 NA NA NA
## 10 1983 France 6 6 0 NA NA NA
## # ℹ 100 more rows
## # ℹ 7 more variables: p4 <dbl>, p5 <dbl>, p6 <dbl>, awards_gold <dbl>,
## # awards_silver <dbl>, awards_bronze <dbl>, awards_honorable_mentions <dbl>
#there are 110 rows that missing value from p1 to p6, meaning that those countries were absence from the competition that year. Therefore, all those rows will be removed
df_cleaned <- df_cleaned[!apply(is.na(df_cleaned[, c("p1", "p2", "p3", "p4", "p5", "p6")]), 1, any), ]
#for missing value in award honorable mention, missing value will be filled with 0
df_cleaned$awards_honorable_mentions [is.na(df_cleaned$awards_honorable_mentions )] <- 0
df_cleaned
## # A tibble: 3,670 × 15
## year country team_size_all team_size_male team_size_female p1 p2 p3
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2024 United… 6 5 1 42 41 19
## 2 2024 People… 6 6 0 42 42 31
## 3 2024 Republ… 6 6 0 42 37 18
## 4 2024 India 6 6 0 42 34 11
## 5 2024 Belarus 6 6 0 42 30 10
## 6 2024 Singap… 6 6 0 42 37 7
## 7 2024 United… 6 6 0 42 33 8
## 8 2024 Hungary 6 6 0 42 37 16
## 9 2024 Poland 6 6 0 42 25 5
## 10 2024 Türkiye 6 5 1 38 37 5
## # ℹ 3,660 more rows
## # ℹ 7 more variables: p4 <dbl>, p5 <dbl>, p6 <dbl>, awards_gold <dbl>,
## # awards_silver <dbl>, awards_bronze <dbl>, awards_honorable_mentions <dbl>
In order to reveal the performance by decade of every countries in the IMO competition, I will aggregate the information into form which can produce meaningful clusters.
The average score from 6 parts within every year competition will be calculated; the total number of awards, and the total awards honorable mentions will also be summarize by each year.
df_cleaned$total_awards <- df_cleaned$awards_gold + df_cleaned$awards_silver + df_cleaned$awards_bronze
df_cleaned$average_score <- rowMeans(df_cleaned[, c("p1", "p2", "p3", "p4", "p5", "p6")])
#create the timeline for every decades instead of years
df_cleaned$decade <- floor(df_cleaned$year / 10) * 10
head(df_cleaned)
## # A tibble: 6 × 18
## year country team_size_all team_size_male team_size_female p1 p2 p3
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2024 United … 6 5 1 42 41 19
## 2 2024 People'… 6 6 0 42 42 31
## 3 2024 Republi… 6 6 0 42 37 18
## 4 2024 India 6 6 0 42 34 11
## 5 2024 Belarus 6 6 0 42 30 10
## 6 2024 Singapo… 6 6 0 42 37 7
## # ℹ 10 more variables: p4 <dbl>, p5 <dbl>, p6 <dbl>, awards_gold <dbl>,
## # awards_silver <dbl>, awards_bronze <dbl>, awards_honorable_mentions <dbl>,
## # total_awards <dbl>, average_score <dbl>, decade <dbl>
pfm_aggregate <- df_cleaned %>% group_by(country, decade) %>%
summarise(
average_score = mean(average_score),
total_awards = sum(total_awards),
awards_honorable_mentions = sum(awards_honorable_mentions),
.groups = "drop")
pfm_aggregate
## # A tibble: 541 × 5
## country decade average_score total_awards awards_honorable_mentions
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Albania 1990 3.42 0 2
## 2 Albania 2000 6.63 7 12
## 3 Albania 2010 6.60 2 18
## 4 Albania 2020 7.07 2 13
## 5 Algeria 1980 6.83 3 1
## 6 Algeria 1990 2.54 0 1
## 7 Algeria 2000 0.333 0 0
## 8 Algeria 2010 7.83 4 11
## 9 Algeria 2020 9.4 4 11
## 10 Angola 2010 0 0 0
## # ℹ 531 more rows
For clustering, I will be using numerical columns and normalize them on the same scale.
cluster_data <- pfm_aggregate[3:5] #numerical columns only
scaled_data <- scale(cluster_data)
head(scaled_data)
## average_score total_awards awards_honorable_mentions
## [1,] -1.0959967 -1.0280125 -0.6707646
## [2,] -0.7592170 -0.6533173 0.6435550
## [3,] -0.7622707 -0.9209567 1.4321468
## [4,] -0.7138477 -0.9209567 0.7749870
## [5,] -0.7382774 -0.8674288 -0.8021965
## [6,] -1.1876078 -1.0280125 -0.8021965
Visualize the data points for clustering on a 3D scatter plot
Before perform any clustering method, I would like to check the ability to clusters by calculating the Hopkin stat
get_clust_tendency(scaled_data, n = nrow(scaled_data) - 1, graph = TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
## $hopkins_stat
## [1] 0.9028108
##
## $plot
Beside the heat map with clear structured data, distinct lines and blocks, the result of Hopkin stat is also really close to 1. In this case, the Hopkin value is 0.903. Therefore, I would say the dataset has high cluster-ability.
optnb<-NbClust(scaled_data, distance="euclidean", min.nc=2, max.nc=10, method="complete", index="ch")
optnb
## $All.index
## 2 3 4 5 6 7 8 9
## 177.4881 261.5802 544.4936 555.5426 510.3006 464.4564 455.7308 487.0920
## 10
## 468.3016
##
## $Best.nc
## Number_clusters Value_Index
## 5.0000 555.5426
##
## $Best.partition
## [1] 1 2 2 2 1 1 1 2 2 1 1 3 4 2 2 3 2 2 3 3 4 4 5 3 5 3 3 2 2 2 1 2 2 2 1 1 2
## [38] 2 3 5 4 3 1 1 3 2 2 2 2 1 1 1 1 1 1 2 1 2 2 3 1 1 1 3 2 4 4 3 1 3 3 5 5 5
## [75] 4 3 1 1 1 1 1 3 4 4 5 3 1 1 2 1 3 3 4 2 2 3 1 2 2 3 2 4 3 1 3 2 1 1 1 1 1
## [112] 2 2 2 3 4 2 3 5 3 5 3 3 3 5 1 2 2 2 1 1 1 1 2 1 1 1 1 2 1 1 2 2 2 1 3 3 2
## [149] 2 2 2 3 3 3 4 4 4 3 1 2 4 2 3 5 5 5 3 3 5 5 4 4 3 1 1 1 3 2 2 2 3 1 1 1 1
## [186] 1 1 1 3 4 4 4 3 5 5 5 5 5 4 3 1 1 1 2 2 3 5 4 4 3 1 1 2 5 3 1 1 1 1 2 2 2
## [223] 3 5 5 5 3 3 3 4 4 4 3 3 1 4 4 4 3 1 1 1 5 5 5 3 2 4 4 3 1 1 2 2 1 1 1 1 1
## [260] 2 2 2 1 3 2 2 2 1 1 1 1 2 2 2 1 1 2 1 1 2 2 2 2 1 1 2 2 3 1 1 1 2 4 4 3 3
## [297] 1 3 2 4 2 3 1 1 1 3 2 2 2 2 1 1 1 1 1 1 3 3 2 2 4 3 1 3 2 2 2 1 2 1 1 2 1
## [334] 3 2 2 2 1 3 2 2 2 1 1 2 1 1 1 1 1 1 1 1 2 2 3 5 5 5 5 1 1 4 4 3 1 1 1 4 3
## [371] 3 3 3 4 4 4 3 1 1 2 2 2 1 1 1 1 3 5 5 5 5 1 4 2 2 5 5 5 5 5 5 3 5 5 5 3 1
## [408] 1 4 3 3 4 3 3 3 4 4 5 3 3 4 4 3 2 2 2 2 2 2 2 2 1 2 2 2 3 1 2 2 2 3 3 3 3
## [445] 2 2 2 1 2 2 3 1 2 2 5 5 5 3 1 2 2 1 1 1 2 4 5 3 1 2 1 1 1 1 1 2 2 1 1 2 2
## [482] 2 1 3 4 4 5 3 1 1 4 5 5 3 5 5 5 3 1 1 1 3 5 5 5 5 4 3 5 5 5 5 5 5 1 1 1 2
## [519] 2 1 2 2 2 1 1 1 1 1 3 5 5 5 5 3 3 3 3 4 3 1 1
Based on the suggestion from NBClust, the optimal number of clusters for my data set will be 5. However, other methods to verify the initial k number should also be applied. Therefore, I would like to also check the other methods such as elbow, silhouette and AIC to see if how many clusters they suggest.
opteb <- Optimal_Clusters_KMeans(scaled_data, max_clusters=10, plot_clusters = TRUE)
optsh <- Optimal_Clusters_KMeans(scaled_data, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
optaic <- Optimal_Clusters_KMeans(scaled_data, max_clusters=10, plot_clusters=TRUE, criterion="AIC")
From these plots, the results are consistent with the appropriate number of clusters is 3, as they achieve the highest silhouette score and the additional clusters do not significantly impact on the final performance of capturing additional structure of data.
Since my data set is not larger, I would like to see the distribution of data with 3 different clustering methods, which are K-means, PAM and hierarchical.
For agglomerative method, first of all, I will compute the coefficient between 4 methods: “average”, “single”, “complete” and “ward”
## single average complete ward
## 0.7889226 0.9548715 0.9784068 0.9965703
Ward has the highest score among all the methods, therefore, I would like to visualize it on a dendrogram.
For divisive method, a dendrogram with Diana will be plot.
For each dendrogram, using scater plot to visualize the clusters
Given the differences in how these methods calculate clusters, the final clusters are slightly different. In the next part, I will compute the silhouette score over all methods to see which method performs the best.
## K-means average silhoutte width: 0.4907977
## PAM average silhoutte width: 0.4819316
## Ward verage silhoutte width 0.4697884
## Diana verage silhoutte width with 0.4934603
## cluster size ave.sil.width
## 1 1 188 0.48
## 2 2 143 0.40
## 3 3 210 0.56
## cluster size ave.sil.width
## 1 1 183 0.62
## 2 2 161 0.36
## 3 3 197 0.46
## cluster size ave.sil.width
## 1 1 203 0.59
## 2 2 108 0.48
## 3 3 230 0.36
## cluster size ave.sil.width
## 1 1 216 0.56
## 2 2 125 0.45
## 3 3 200 0.45
The silhouette score for K-means, PAM and Diana are relatively similar. The Ward method has the lowest silhouette width, implying less effective clustering compared to the other methods.
While Diana shows a slightly better clustering quality compare to the rest, considering the efficiency and scalability of K-means and the minimal difference in silhouette score, makes it the preferred choice for further clustering and analysis. K-means offers a balance between quality and practicality, ensuring accurate clustering while remaining computationally efficient.
Regarding the recommendation about number clusters from NbClust above, it suggests to cluster my data set into 5 clusters. So now, I would like to compute the Calinski-Harabasz index to verify about the quality of clustering regarding the number of clusters as 3 or 5.
## Calinski-Harabasz index for 3 clusters: 665.4
## Calinski-Harabasz index for 5 clusters: 598.23
From the Calinski-Harabasz index, the higher statistics the better. Therefore, 3 clusters would produce better insight rather than 5.
Besides, the shadow statistic will also be used to evaluate clustering quality. It is closely related to the silhouette score but provides additional insights into cluster cohesion and separation by focusing on the distance of each data point to its cluster centroid and the second-closest centroid.
##
## The downloaded binary packages are in
## /var/folders/ld/rwg8qntx0sg34zkdmmh8mcc80000gp/T//RtmpKLUfqC/downloaded_packages
## 1 2 3
## 0.5470393 0.5913554 0.4825425
In contrast to silhouette values, which provide a direct measure of how well separated the cluster are, shadow values give insight into geometric arrangement of the points within the cluster relative to their centroids. From the shadow statistic, one can see that all clusters receive roughly 0.5 in scores. Cluster 2 are the most compact, data points from the other 2 clusters are a bit spread out over the cluster.
From the plot, we can recognize the trace of 3 clusters. Cluster 1 (red) are those who received both high score and number of awards. However, they have fewer honorable mentions compared to their overall performance.
For cluster 2 (yellow), they maintain an average score and a reasonable number of total awards. Interestingly, despite these averages scores, this cluster is characterized by a relatively high number of award mentions.
The 3rd cluster (green) includes countries with the lowest performance in terms of scores and total awards, accompanied by low honorable mentions.
The insights of cluster can be critical for understand the strength and weaknesses of different Olympiad team for various countries. Those countries that fall into cluster 3 could benefit from increased support and resources to enhance their performance in future competitions.
To have more vivid picture on the performance of countries participated in IMO, I have plotted this map to serve as a tool for understanding the global participation trends, visualize how different countries were grouped into clusters (1, 2, or 3) across various decades, the presence of NA values is for countries that have never been participate in the IMO.
In the 1960s, there was relatively small number of countries participated, mostly from the Soviet Union and Eastern Europe. Later on, the number of participants in the Mathematical Olympiad expands, there is a noticeable variability in cluster members by decade. Until now, there are participants from 5 continents around the world.
Cluster 1 was dominant in earlier decades, representing countries that excelled in terms of both awards and scores. Most of the countries in this cluster are from Europe, America and Australia, which are typically developed nations with long-standing traditions in competitive mathematics.
Cluster 2 exhibits moderate performance across decades. However, their geographical distribution is more fragmented, suggesting that moderate performers are spread across different regions without a strong concentration.
Countries in cluster 3 are predominantly from Africa and other regions with relatively lower scores and fewer awards. A significant trend is observed between the 1980s and 1990s, where some participants from South America improved their performance and transitioned from cluster 3 to cluster 2. This demonstrates the developments in mathematics training and education during that time in those countries.
Additionally, it is worth to mentioned that even though China and Australia joined the contest relatively later, in the 1980s, they have maintained consistent performance over the decades. Their strong and stable results are reflected in their consistent appearance in cluster 1, which represents countries with moderate to high performance in terms of scores and awards.
In conclusion, earlier decades are dominated by high-performing countries, reflecting the initial stronghold of a few nations (Hungary, Romania, Russia) while later decades show increased diversity and there are more countries that fall into moderate or low-performance clusters.
The official International Mathematical Olympiad https://www.imo-official.org